The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Tue Jun 23 23:13:49 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Tue Jun 23 23:13:49 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X6.22.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X6.22.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X6.22.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X6.22.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 153 153
Albania 153 153
Algeria 153 153
Andorra 153 153
Angola 153 153
Antigua and Barbuda 153 153
Argentina 153 153
Armenia 153 153
Australia 1224 1224
Austria 153 153
Azerbaijan 153 153
Bahamas 153 153
Bahrain 153 153
Bangladesh 153 153
Barbados 153 153
Belarus 153 153
Belgium 153 153
Belize 153 153
Benin 153 153
Bhutan 153 153
Bolivia 153 153
Bosnia and Herzegovina 153 153
Botswana 153 153
Brazil 153 153
Brunei 153 153
Bulgaria 153 153
Burkina Faso 153 153
Burma 153 153
Burundi 153 153
Cabo Verde 153 153
Cambodia 153 153
Cameroon 153 153
Canada 2142 2142
Central African Republic 153 153
Chad 153 153
Chile 153 153
China 5049 5049
Colombia 153 153
Comoros 153 153
Congo (Brazzaville) 153 153
Congo (Kinshasa) 153 153
Costa Rica 153 153
Cote d’Ivoire 153 153
Croatia 153 153
Cuba 153 153
Cyprus 153 153
Czechia 153 153
Denmark 459 459
Diamond Princess 153 153
Djibouti 153 153
Dominica 153 153
Dominican Republic 153 153
Ecuador 153 153
Egypt 153 153
El Salvador 153 153
Equatorial Guinea 153 153
Eritrea 153 153
Estonia 153 153
Eswatini 153 153
Ethiopia 153 153
Fiji 153 153
Finland 153 153
France 1683 1683
Gabon 153 153
Gambia 153 153
Georgia 153 153
Germany 153 153
Ghana 153 153
Greece 153 153
Grenada 153 153
Guatemala 153 153
Guinea 153 153
Guinea-Bissau 153 153
Guyana 153 153
Haiti 153 153
Holy See 153 153
Honduras 153 153
Hungary 153 153
Iceland 153 153
India 153 153
Indonesia 153 153
Iran 153 153
Iraq 153 153
Ireland 153 153
Israel 153 153
Italy 153 153
Jamaica 153 153
Japan 153 153
Jordan 153 153
Kazakhstan 153 153
Kenya 153 153
Korea, South 153 153
Kosovo 153 153
Kuwait 153 153
Kyrgyzstan 153 153
Laos 153 153
Latvia 153 153
Lebanon 153 153
Lesotho 153 153
Liberia 153 153
Libya 153 153
Liechtenstein 153 153
Lithuania 153 153
Luxembourg 153 153
Madagascar 153 153
Malawi 153 153
Malaysia 153 153
Maldives 153 153
Mali 153 153
Malta 153 153
Mauritania 153 153
Mauritius 153 153
Mexico 153 153
Moldova 153 153
Monaco 153 153
Mongolia 153 153
Montenegro 153 153
Morocco 153 153
Mozambique 153 153
MS Zaandam 153 153
Namibia 153 153
Nepal 153 153
Netherlands 765 765
New Zealand 153 153
Nicaragua 153 153
Niger 153 153
Nigeria 153 153
North Macedonia 153 153
Norway 153 153
Oman 153 153
Pakistan 153 153
Panama 153 153
Papua New Guinea 153 153
Paraguay 153 153
Peru 153 153
Philippines 153 153
Poland 153 153
Portugal 153 153
Qatar 153 153
Romania 153 153
Russia 153 153
Rwanda 153 153
Saint Kitts and Nevis 153 153
Saint Lucia 153 153
Saint Vincent and the Grenadines 153 153
San Marino 153 153
Sao Tome and Principe 153 153
Saudi Arabia 153 153
Senegal 153 153
Serbia 153 153
Seychelles 153 153
Sierra Leone 153 153
Singapore 153 153
Slovakia 153 153
Slovenia 153 153
Somalia 153 153
South Africa 153 153
South Sudan 153 153
Spain 153 153
Sri Lanka 153 153
Sudan 153 153
Suriname 153 153
Sweden 153 153
Switzerland 153 153
Syria 153 153
Taiwan* 153 153
Tajikistan 153 153
Tanzania 153 153
Thailand 153 153
Timor-Leste 153 153
Togo 153 153
Trinidad and Tobago 153 153
Tunisia 153 153
Turkey 153 153
Uganda 153 153
Ukraine 153 153
United Arab Emirates 153 153
United Kingdom 1683 1683
Uruguay 153 153
US 153 153
US_state 498933 498933
Uzbekistan 153 153
Venezuela 153 153
Vietnam 153 153
West Bank and Gaza 153 153
Western Sahara 153 153
Yemen 153 153
Zambia 153 153
Zimbabwe 153 153
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 6097
Alaska 1252
Arizona 1481
Arkansas 6480
California 5643
Colorado 5507
Connecticut 880
Delaware 371
Diamond Princess 98
District of Columbia 99
Florida 6402
Georgia 14518
Grand Princess 99
Guam 99
Hawaii 507
Idaho 2971
Illinois 8587
Indiana 8401
Iowa 8080
Kansas 6878
Kentucky 9706
Louisiana 6035
Maine 1542
Maryland 2321
Massachusetts 1464
Michigan 7508
Minnesota 7174
Mississippi 7492
Missouri 8724
Montana 2675
Nebraska 5331
Nevada 1206
New Hampshire 1049
New Jersey 2213
New Mexico 2644
New York 5657
North Carolina 8936
North Dakota 3301
Northern Mariana Islands 84
Ohio 7941
Oklahoma 6199
Oregon 3059
Pennsylvania 6209
Puerto Rico 99
Rhode Island 586
South Carolina 4342
South Dakota 4180
Tennessee 8629
Texas 18913
Utah 1344
Vermont 1410
Virgin Islands 99
Virginia 11373
Washington 3853
West Virginia 4288
Wisconsin 6146
Wyoming 1918
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 70275
China 153
Diamond Princess 134
Korea, South 124
Japan 123
Italy 121
Iran 118
Singapore 115
France 114
Germany 114
Spain 113
US 111
Switzerland 110
United Kingdom 110
Belgium 109
Netherlands 109
Norway 109
Sweden 109
Austria 107
Malaysia 106
Australia 105
Bahrain 105
Denmark 105
Canada 104
Qatar 104
Iceland 103
Brazil 102
Czechia 102
Finland 102
Greece 102
Iraq 102
Israel 102
Portugal 102
Slovenia 102
Egypt 101
Estonia 101
India 101
Ireland 101
Kuwait 101
Philippines 101
Poland 101
Romania 101
Saudi Arabia 101
Indonesia 100
Lebanon 100
Pakistan 100
San Marino 100
Thailand 100
Chile 99
Luxembourg 98
Peru 98
Russia 98
Ecuador 97
Mexico 97
Slovakia 97
South Africa 97
United Arab Emirates 97
Armenia 96
Colombia 96
Croatia 96
Panama 96
Serbia 96
Taiwan* 96
Turkey 96
Argentina 95
Bulgaria 95
Latvia 95
Uruguay 95
Algeria 94
Costa Rica 94
Dominican Republic 94
Hungary 94
Andorra 93
Bosnia and Herzegovina 93
Jordan 93
Lithuania 93
Morocco 93
New Zealand 93
North Macedonia 93
Vietnam 93
Albania 92
Cyprus 92
Malta 92
Moldova 92
Brunei 91
Burkina Faso 91
Sri Lanka 91
Tunisia 91
Ukraine 90
Azerbaijan 89
Ghana 89
Kazakhstan 89
Oman 89
Senegal 89
Venezuela 89
Afghanistan 88
Cote d’Ivoire 88
Cuba 87
Mauritius 87
Uzbekistan 87
Cambodia 86
Cameroon 86
Honduras 86
Nigeria 86
West Bank and Gaza 86
Belarus 85
Georgia 85
Bolivia 84
Kosovo 84
Kyrgyzstan 84
Montenegro 84
Congo (Kinshasa) 83
Kenya 82
Niger 81
Guinea 80
Rwanda 80
Trinidad and Tobago 80
Paraguay 79
Bangladesh 78
Djibouti 76
El Salvador 75
Guatemala 74
Madagascar 73
Mali 72
Congo (Brazzaville) 69
Jamaica 69
Gabon 67
Somalia 67
Tanzania 67
Ethiopia 66
Burma 65
Sudan 64
Liberia 63
Maldives 61
Equatorial Guinea 60
Cabo Verde 58
Sierra Leone 56
Guinea-Bissau 55
Togo 55
Zambia 54
Eswatini 53
Chad 52
Tajikistan 51
Haiti 49
Sao Tome and Principe 49
Benin 47
Nepal 47
Uganda 47
Central African Republic 46
South Sudan 46
Guyana 44
Mozambique 43
Yemen 39
Mongolia 38
Mauritania 35
Nicaragua 35
Malawi 29
Syria 29
Zimbabwe 27
Bahamas 26
Libya 26
Comoros 24
Suriname 16
Angola 13
Eritrea 8
Burundi 7
Monaco 1
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 153
Korea, South 124
Japan 123
Italy 121
Iran 118
Singapore 115
France 114
Germany 114
Spain 113
US 111
Switzerland 110
United Kingdom 110
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-03-07        18328
## 2    Afghanistan           <NA> <NA> 2020-03-04        18325
## 3    Afghanistan           <NA> <NA> 2020-03-05        18326
## 4    Afghanistan           <NA> <NA> 2020-03-09        18330
## 5    Afghanistan           <NA> <NA> 2020-03-03        18324
## 6    Afghanistan           <NA> <NA> 2020-04-12        18364
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                      0                     1     0.00000000
## 2                      0                     1     0.00000000
## 3                      0                     1     0.00000000
## 4                      0                     4     0.00000000
## 5                      0                     1     0.00000000
## 6                     18                   607     0.02965404
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                  0.000000                       -Inf        18348
## 2                  0.000000                       -Inf        18348
## 3                  0.000000                       -Inf        18348
## 4                  0.602060                       -Inf        18348
## 5                  0.000000                       -Inf        18348
## 6                  2.783189                   1.255273        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1            -20  NA   NA         NA                           NA
## 2            -23  NA   NA         NA                           NA
## 3            -22  NA   NA         NA                           NA
## 4            -18  NA   NA         NA                           NA
## 5            -24  NA   NA         NA                           NA
## 6             16  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit 1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Parks 0.9997631 -72.0
Hawaii Transit 0.9945906 -89.0
New Hampshire Parks 0.9500025 -20.0
Vermont Parks 0.9292450 -35.5
Maine Transit -0.9183879 -50.0
Connecticut Grocery_Pharmacy -0.8836749 -6.0
Utah Residential -0.8612118 12.0
Utah Transit -0.8536302 -18.0
Hawaii Grocery_Pharmacy 0.8514813 -34.0
South Dakota Parks 0.7952497 -26.0
Hawaii Retail_Recreation 0.7758604 -56.0
Arizona Grocery_Pharmacy -0.7703179 -15.0
Rhode Island Workplace -0.7691260 -39.5
Arizona Residential 0.7610960 13.0
Connecticut Transit -0.7520667 -50.0
Massachusetts Workplace -0.7507857 -39.0
Wyoming Parks -0.7452880 -4.0
Alaska Residential 0.7355127 13.0
Alaska Grocery_Pharmacy -0.7170269 -7.0
Nevada Transit -0.7029312 -20.0
Vermont Grocery_Pharmacy -0.6570931 -25.0
New York Workplace -0.6539532 -34.5
North Dakota Parks 0.6388217 -34.0
Idaho Residential -0.6364649 11.0
Rhode Island Retail_Recreation -0.6280705 -45.0
Alaska Workplace -0.6216855 -33.0
Utah Parks -0.6003977 17.0
New Jersey Parks -0.5992988 -6.0
Rhode Island Residential -0.5981837 18.5
New York Retail_Recreation -0.5906486 -46.0
Maine Workplace -0.5834345 -30.0
Nebraska Workplace 0.5808514 -32.0
Arkansas Parks 0.5565848 -12.0
Utah Workplace -0.5345253 -37.0
New York Parks 0.5297587 20.0
New Jersey Workplace -0.5228861 -44.0
Connecticut Retail_Recreation -0.5158455 -45.0
Arizona Retail_Recreation -0.5115787 -42.5
Connecticut Residential 0.5077113 14.0
Maine Parks 0.4991376 -31.0
Massachusetts Retail_Recreation -0.4945610 -44.0
New Mexico Grocery_Pharmacy -0.4850504 -11.0
New Jersey Grocery_Pharmacy -0.4850260 2.5
Connecticut Workplace -0.4705395 -39.0
Nebraska Residential -0.4659567 14.0
Arizona Transit 0.4654520 -38.0
New Mexico Residential 0.4583534 13.5
Maryland Workplace -0.4577032 -35.0
Rhode Island Parks 0.4463111 52.0
Montana Parks -0.4412191 -58.0
Missouri Residential -0.4386430 13.0
North Dakota Retail_Recreation -0.4270234 -42.0
Iowa Parks -0.4242621 28.5
Illinois Transit -0.4211731 -31.0
Pennsylvania Workplace -0.4183900 -36.0
West Virginia Parks 0.4171136 -33.0
Kentucky Parks -0.4135970 28.5
Maryland Grocery_Pharmacy -0.4121674 -10.0
New Jersey Retail_Recreation -0.4086278 -62.5
New Jersey Transit -0.4063474 -50.5
Massachusetts Grocery_Pharmacy -0.4060699 -7.0
New Mexico Parks 0.3990230 -31.5
Vermont Residential 0.3955022 11.5
Pennsylvania Retail_Recreation -0.3928510 -45.0
New Hampshire Residential -0.3906960 14.0
Oregon Parks -0.3896059 16.5
Montana Residential 0.3860685 14.0
Alabama Workplace -0.3837415 -29.0
New Mexico Retail_Recreation -0.3781122 -42.5
South Carolina Workplace 0.3755446 -30.0
Michigan Parks 0.3680930 28.5
South Carolina Parks -0.3663328 -23.0
New York Transit -0.3659469 -48.0
Alabama Grocery_Pharmacy -0.3592392 -2.0
Montana Transit 0.3469755 -41.0
Nebraska Grocery_Pharmacy 0.3415338 -0.5
Wisconsin Transit -0.3296085 -23.5
North Dakota Workplace 0.3266216 -40.0
Maryland Retail_Recreation -0.3256385 -39.0
Arkansas Retail_Recreation -0.3213185 -30.0
Virginia Transit -0.3184925 -32.5
Alaska Retail_Recreation 0.3136481 -39.0
Maine Retail_Recreation -0.3136133 -42.0
Florida Residential 0.3052196 14.0
Colorado Residential 0.3051167 14.0
Idaho Workplace -0.3050337 -29.0
California Residential 0.3040292 14.0
Nevada Residential 0.3020367 17.0
Kansas Parks 0.3016630 72.0
California Parks -0.2973788 -38.5
Illinois Workplace -0.2941516 -30.5
Idaho Grocery_Pharmacy -0.2910004 -5.5
California Transit -0.2893836 -42.0
Minnesota Transit -0.2832093 -28.5
Pennsylvania Parks 0.2799563 12.0
Idaho Transit -0.2758650 -30.0
Missouri Workplace 0.2661574 -29.0
West Virginia Grocery_Pharmacy -0.2660821 -6.0
North Carolina Workplace 0.2654490 -31.0
Pennsylvania Grocery_Pharmacy -0.2653576 -6.0
Maryland Residential 0.2642373 15.0
Vermont Workplace -0.2640719 -43.0
Rhode Island Grocery_Pharmacy 0.2621455 -7.5
North Carolina Grocery_Pharmacy 0.2613809 0.0
Vermont Retail_Recreation 0.2609259 -57.0
Kansas Workplace 0.2596677 -32.0
Wyoming Grocery_Pharmacy -0.2585244 -10.0
Alabama Transit -0.2550550 -36.5
Illinois Parks 0.2533417 26.5
Georgia Grocery_Pharmacy -0.2531847 -10.0
Oregon Grocery_Pharmacy -0.2506821 -7.0
Tennessee Workplace -0.2485367 -31.0
Nevada Workplace -0.2481077 -40.0
Rhode Island Transit -0.2460045 -56.0
Wisconsin Parks 0.2387228 51.5
Tennessee Residential 0.2372077 11.5
Missouri Parks 0.2369870 0.0
Minnesota Parks 0.2367037 -9.0
Texas Workplace 0.2360379 -32.0
Florida Parks -0.2356046 -43.0
Washington Grocery_Pharmacy 0.2345481 -7.0
New York Grocery_Pharmacy -0.2278243 8.0
Hawaii Workplace -0.2274053 -46.0
Georgia Workplace -0.2269829 -33.5
South Dakota Workplace 0.2266600 -35.0
Indiana Parks -0.2250577 29.0
Tennessee Parks -0.2240310 10.5
Mississippi Residential 0.2222360 13.0
Arizona Parks -0.2212846 -44.5
Georgia Retail_Recreation -0.2158346 -41.0
Nevada Retail_Recreation -0.2140783 -43.0
North Dakota Grocery_Pharmacy -0.2089604 -8.0
Nebraska Transit -0.2080867 -9.0
Wyoming Workplace -0.2051313 -31.0
Connecticut Parks 0.2007874 43.0
Alabama Parks 0.1992378 -1.0
West Virginia Workplace 0.1991490 -33.0
Texas Residential -0.1967141 15.0
North Carolina Transit 0.1964537 -32.0
Illinois Residential 0.1962095 14.0
Mississippi Grocery_Pharmacy -0.1942328 -8.0
Kansas Grocery_Pharmacy -0.1942182 -14.5
Oregon Residential -0.1886772 10.5
Colorado Parks -0.1881083 2.0
South Dakota Residential 0.1864877 15.0
Wisconsin Residential -0.1817357 14.0
Virginia Residential 0.1807104 14.0
North Carolina Residential 0.1794723 13.0
Oklahoma Parks 0.1785625 -18.5
New Hampshire Retail_Recreation -0.1719158 -41.0
Pennsylvania Transit -0.1696756 -42.0
Ohio Transit 0.1670688 -28.0
Idaho Parks 0.1643543 -22.0
Kentucky Grocery_Pharmacy 0.1633245 4.0
Indiana Residential 0.1629982 12.0
Massachusetts Parks 0.1628029 39.0
Utah Retail_Recreation -0.1619549 -40.0
Kentucky Transit -0.1578391 -31.0
New Jersey Residential 0.1466405 18.0
New Mexico Transit 0.1465065 -38.5
Idaho Retail_Recreation -0.1462595 -39.5
North Dakota Residential -0.1448149 17.0
Texas Parks 0.1447510 -42.0
Kentucky Residential 0.1425660 12.0
Montana Retail_Recreation 0.1423628 -51.0
Mississippi Retail_Recreation -0.1359996 -40.0
Minnesota Retail_Recreation 0.1352388 -40.5
Iowa Transit 0.1306897 -24.0
Virginia Grocery_Pharmacy -0.1301468 -8.0
Oregon Transit 0.1295609 -27.5
Kansas Transit -0.1273557 -23.0
West Virginia Residential -0.1272501 11.0
North Dakota Transit 0.1231319 -48.0
North Carolina Retail_Recreation 0.1231122 -34.0
Wisconsin Grocery_Pharmacy 0.1230843 -1.0
Texas Grocery_Pharmacy 0.1225714 -14.0
Indiana Retail_Recreation 0.1223369 -38.0
Utah Grocery_Pharmacy 0.1223225 -4.0
Wyoming Transit -0.1217514 -17.0
Michigan Workplace -0.1202643 -40.0
Mississippi Parks -0.1172051 -25.0
Montana Workplace -0.1154619 -40.0
California Grocery_Pharmacy -0.1148035 -11.5
Wisconsin Workplace -0.1082624 -31.0
Iowa Workplace 0.1065924 -30.0
Nebraska Retail_Recreation 0.1058510 -36.0
Oklahoma Residential -0.1049674 15.0
Oklahoma Grocery_Pharmacy -0.1047358 -0.5
Missouri Transit -0.1039255 -24.5
Alabama Retail_Recreation 0.1039129 -39.0
Hawaii Residential -0.1038731 19.0
Arkansas Workplace -0.1035676 -26.0
New Hampshire Grocery_Pharmacy -0.1034886 -6.0
Virginia Parks 0.1027960 6.0
Oklahoma Workplace 0.1023492 -31.0
Massachusetts Transit -0.1009664 -45.0
Maryland Transit -0.1003695 -39.0
Wyoming Residential 0.0998429 12.5
Massachusetts Residential 0.0982025 15.0
Minnesota Grocery_Pharmacy 0.0980681 -6.0
South Carolina Transit -0.0974959 -45.0
Michigan Transit 0.0950304 -46.0
Indiana Workplace 0.0941466 -34.0
Iowa Grocery_Pharmacy -0.0937225 4.0
California Retail_Recreation -0.0935755 -44.0
Oregon Retail_Recreation 0.0903275 -40.5
Florida Retail_Recreation 0.0898243 -43.0
New York Residential 0.0896142 17.5
West Virginia Transit -0.0882745 -45.0
Florida Transit -0.0875299 -49.0
Michigan Residential 0.0837502 15.0
Michigan Retail_Recreation -0.0816537 -53.0
Ohio Residential 0.0807802 14.0
Virginia Retail_Recreation -0.0798669 -35.0
Alabama Residential -0.0767367 11.0
Minnesota Residential -0.0751502 17.0
Pennsylvania Residential 0.0735748 15.0
Kentucky Retail_Recreation 0.0701505 -29.0
Ohio Grocery_Pharmacy 0.0700749 0.0
North Carolina Parks -0.0676450 7.0
Texas Retail_Recreation 0.0667887 -40.0
Georgia Parks 0.0636380 -6.0
Washington Transit -0.0635711 -33.5
Washington Retail_Recreation 0.0635628 -42.0
South Dakota Transit -0.0629819 -40.0
Virginia Workplace -0.0619571 -32.0
Georgia Transit -0.0614824 -35.0
Oregon Workplace -0.0611953 -31.0
Mississippi Transit -0.0610678 -38.5
Texas Transit 0.0598542 -41.5
Vermont Transit -0.0586308 -63.0
Tennessee Transit 0.0583869 -32.0
West Virginia Retail_Recreation -0.0582968 -38.5
South Dakota Retail_Recreation -0.0570510 -39.0
Georgia Residential -0.0564561 13.0
Iowa Retail_Recreation 0.0536472 -38.0
Nevada Parks 0.0502160 -12.5
Wyoming Retail_Recreation 0.0472844 -39.0
Arkansas Grocery_Pharmacy -0.0466932 3.0
Indiana Transit 0.0459748 -29.0
South Carolina Residential -0.0450384 12.0
Nebraska Parks 0.0448133 55.5
Ohio Retail_Recreation 0.0438672 -36.0
New Hampshire Workplace 0.0434491 -37.0
Colorado Transit 0.0412601 -36.0
Kentucky Workplace -0.0412304 -36.5
Maine Residential -0.0396122 11.0
Ohio Workplace -0.0388246 -35.0
New Hampshire Transit 0.0386306 -57.0
California Workplace -0.0385168 -36.0
New Mexico Workplace 0.0378586 -34.0
Florida Workplace 0.0374989 -33.0
Arizona Workplace -0.0373195 -35.0
Arkansas Transit -0.0365070 -27.0
Minnesota Workplace -0.0364746 -33.0
Ohio Parks -0.0362487 67.5
Florida Grocery_Pharmacy 0.0350142 -14.0
Illinois Grocery_Pharmacy -0.0341411 2.0
Illinois Retail_Recreation 0.0313491 -40.0
Colorado Retail_Recreation -0.0312952 -44.0
Colorado Grocery_Pharmacy -0.0293102 -17.0
Tennessee Retail_Recreation -0.0285071 -30.0
Montana Grocery_Pharmacy -0.0260919 -16.0
Missouri Grocery_Pharmacy -0.0258582 2.0
Missouri Retail_Recreation -0.0242662 -36.0
Maine Grocery_Pharmacy -0.0214095 -13.0
Oklahoma Transit 0.0171271 -26.0
Oklahoma Retail_Recreation -0.0165233 -31.0
Wisconsin Retail_Recreation 0.0163483 -44.0
South Carolina Retail_Recreation 0.0151640 -35.0
Kansas Residential -0.0149814 13.0
Washington Parks 0.0144744 -3.5
South Dakota Grocery_Pharmacy -0.0132674 -9.0
Colorado Workplace 0.0131787 -39.0
Maryland Parks 0.0123880 27.0
Tennessee Grocery_Pharmacy 0.0098502 6.0
Washington Residential -0.0096590 13.0
Washington Workplace 0.0081303 -38.0
Arkansas Residential -0.0076118 12.0
South Carolina Grocery_Pharmacy 0.0071896 1.0
Indiana Grocery_Pharmacy -0.0071601 -5.5
Nevada Grocery_Pharmacy -0.0069692 -12.5
Mississippi Workplace 0.0051104 -33.0
Iowa Residential 0.0019046 13.0
Michigan Grocery_Pharmacy 0.0011966 -11.0
Kansas Retail_Recreation -0.0001588 -39.0
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Tue Jun 23 23:15:17 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net